Skip to content

[feat] Support Qwen3_5 Training#143

Merged
KemingWu merged 7 commits intotransformer_5.0from
feat/qwen3_5
Mar 10, 2026
Merged

[feat] Support Qwen3_5 Training#143
KemingWu merged 7 commits intotransformer_5.0from
feat/qwen3_5

Conversation

@KemingWu
Copy link
Copy Markdown
Collaborator

@KemingWu KemingWu commented Mar 9, 2026

Motivation

Modifications

Commit Message Convention

Please follow our standardized commit message format:

  • [feat] - New features or functionality
  • [fix] - Bug fixes
  • [docs] - Documentation changes only
  • [style] - Code style changes (formatting, missing semicolons, etc.)
  • [refactor] - Code refactoring without changing functionality
  • [perf] - Performance improvements
  • [test] - Adding or updating tests
  • [chore] - Maintenance tasks, dependency updates, etc.
  • [ci] - CI/CD configuration changes

Examples:

  • [feat] add qwen omni iterable dataset support
  • [fix] resolve bagel model configuration error
  • [docs] update training guide with YAML examples

See CONTRIBUTING.md for more details.

CI/CD Checks

Your PR will automatically run the following checks:

  • Linting: Code formatting with black (line-length=120) and import sorting with isort
  • Run pre-commit run --all-files locally to verify before pushing

Checklist

  • Follow commit message convention (see above)
  • Run pre-commit run --all-files and ensure all checks pass
  • Format your code with black (line-length=120) and isort
  • Add unit tests for new functionality
  • Update documentation as needed, including docstrings or example tutorials
  • Ensure all CI/CD checks pass

@KemingWu KemingWu requested review from Luodian and kcz358 March 9, 2026 14:00
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part looks to be unnecessary. Can directly use vision iterable

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one. Seems no need for overriding the load from json method?

Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint the hf and transformers repo, seems like the qwen3.5 uses the exact same logic as qwen3. So I think all the data processing class can use the qwen3 processor and dataset.

@KemingWu KemingWu requested a review from kcz358 March 10, 2026 03:21
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to notice here. Qwen3 5 uses hybrid attention, linear+full. Can we just use the flops function for qwen2?

Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for me. I think the estimate flops is a bit inaccurate. If can't sure what's the flop function for gated delta net, maybe can leave it empty or wait to see if we can copy from verl etc. :)

@KemingWu KemingWu merged commit b585f4a into transformer_5.0 Mar 10, 2026
3 checks passed
@KemingWu KemingWu deleted the feat/qwen3_5 branch March 10, 2026 08:10
kcz358 added a commit that referenced this pull request Mar 12, 2026
* feat(models): add transformers 5.0 compatibility

Conditionally import models incompatible with transformers >= 5.0:
- dream_dllm, qwen3_dllm, llada_dllm require transformers < 5.0
- llava_onevision1_5 requires transformers < 5.0
- Dynamically update __all__ based on transformers version
- Prevents ImportError when using transformers 5.0+

* fix(train): add group_by_length for backward compatibility

Add group_by_length parameter to TrainingArguments to maintain
compatibility with existing training configurations.

* feat(deps): allow transformers >= 4.57.1

Update transformers dependency from exact version to minimum version
to support transformers 5.0+ while maintaining backward compatibility.

* style: auto-fix lint (black + isort)

* refactor(processor): replace additional_special_tokens with all_special_tokens

Use all_special_tokens for transformers >= 5.0 compatibility while
maintaining backward compatibility with transformers < 5.0.

Changes:
- Add special_tokens property to all processor classes
- Use all_special_tokens if available (transformers >= 5.0)
- Fall back to additional_special_tokens (transformers < 5.0)
- Add <|im_start|> and <||im_end|> tokens to special_tokens list
- Cache special_tokens as instance attribute for performance

Affected processors:
- AeroDataProcessor (base class)
- BaseQwen2_5_DataProcessor (inherits from AeroDataProcessor)
- Qwen2VLDataProcessor
- Qwen2DataProcessor
- LLaVADataProcessor
- LLaVAVideoDataProcessor (inherits from LLaVADataProcessor)
- NanovlmDataProcessor
- Qwen3_VLDataProcessor (inherits from BaseQwen2_5_DataProcessor)

* style: auto-fix lint (black + isort)

* refactor(processor): unify apply_chat_template usage

Use processor.apply_chat_template with tokenize=True consistently
across all processors instead of mixing with processor.tokenizer calls.

Changes:
- aero_processor: use processor.apply_chat_template(tokenize=True)[0]
- base_qwen2_5_processor: use processor.apply_chat_template(tokenize=True)[0]
- qwen2_vl_processor: use processor.apply_chat_template(tokenize=True)
- qwen3_vl_processor: use processor.apply_chat_template(tokenize=True)[0]

This ensures all processors return token IDs directly during data
preparation, improving consistency and reducing confusion.

* feat(models): add common_ops for transformer-agnostic rope index

Extract rope index calculation functions into common_ops/rope.py to
ensure consistent behavior across transformers versions.

Changes:
- Add common_ops/rope.py with qwen2_5_vl_rope_index and qwen3_vl_get_rope_index
- Update qwen2_5_vl_ops.py to use qwen2_5_vl_rope_index
- Update qwen3_vl_ops.py to use qwen3_vl_get_rope_index
- Update qwen3_vl_moe_ops.py to use qwen3_vl_get_rope_index

This ensures rope index calculations remain stable even when transformers
internal implementations change.

* fix(utils): add B200/B300 GPU FLOPS support

Add NVIDIA B200/B300 GPU FLOPS (2.25e15) to get_device_flops()
to fix MFU calculation returning 0 on B200 GPUs.

Previously, unknown GPU types returned inf FLOPS, causing MFU
to always be 0.

* Lint

* fix(models): qwen2_5_vl transformers 5.0 compatibility

- Fix vision_model variable reference in liger kernel patch
- Support nested text_config in lce_forward
- Handle rope_scaling/rope_parameters for transformers 5.0+
- Add qwen2_5_vl to FlopsCounter model type mapping

* refactor(processor): use DataUtilities.apply_chat_template for transformers 5.0 compatibility

- Add apply_chat_template utility method to DataUtilities
- Handles dict-like return values (BatchEncoding) with use_key param
- Handles nested list wrapping from some processors
- Update all processors to use unified method

* feat(launch): add filter_training_args for transformers 5.0 compatibility

Filter unsupported TrainingArguments parameters by inspecting
transformers.TrainingArguments.__init__ signature, avoiding errors
from deprecated or removed parameters in newer versions.

* fix(models): add parse_visual_output for transformers 5.0 compatibility

Visual model methods (get_image_features, get_video_features, visual())
may return tuples OR dataclass objects (BaseModelOutputWithPooling,
BaseModelOutputWithDeepstackFeatures) in transformers 5.0+.

Add parse_visual_output() to transparently handle both return types.

* [feat] Support Qwen3_5 Training (#143)

* [feat] Support Qwen3_5 Training

* style: auto-fix lint (black + isort)

* [feat] Support Qwen3.5 Training

* optimize qwen3.5 dataset process logic

* optimize qwen3.5 dataset process logic

* flop function leave empty

---------

Co-authored-by: charlesswu <charlesswu@tencent.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix(processor): remove duplicate special_tokens property in qwen2_vl_processor

* fix(models): remove duplicate .to() calls in qwen2_5_omni_liger

* fix(models): define input_ids_rmpad in inputs_embeds branch to avoid NameError

* refactor(models): extract parse_visual_output to common_ops/visual.py

* refactor(processor): extract special_tokens logic to DataUtilities.get_special_tokens

* style: auto-fix lint (black + isort)

* docs: add Transformers 5.0 migration guide

Add comprehensive migration guide for transformers 5.0 compatibility.
Includes compatibility matrix, installation instructions, and troubleshooting
for Qwen3.5 (requires >= 5.3.0) and legacy models (requires < 5.0.0).

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: wukeming <108406625+KemingWu@users.noreply.github.com>
Co-authored-by: charlesswu <charlesswu@tencent.com>
Co-authored-by: mwxely <yang0756@e.ntu.edu.sg>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants